# Image caption generation

Gemma 3 4b It Qat 4bit
Other
Gemma 3 4B IT QAT 4bit is a 4-bit quantized large language model trained with Quantization-Aware Training (QAT), based on the Gemma 3 architecture and optimized for the MLX framework.
Image-to-Text Transformers Other
G
mlx-community
607
1
Florence 2 Base Gpt4 Captioner V1
MIT
A GPT4-O style caption generator fine-tuned based on Florence-2-base-ft for generating image descriptions
Image-to-Text Transformers Supports Multiple Languages
F
Vimax97
224
0
Llama 3.2 11B Vision Instruct Nf4
4-bit quantized version based on meta-llama/Llama-3.2-11B-Vision-Instruct, supporting image understanding and text generation tasks
Image-to-Text Transformers
L
SeanScripts
658
12
Tvl Mini 0.1
Apache-2.0
This is a LORA fine-tuned version of the Qwen2-VL-2B model for Russian, supporting multimodal tasks.
Image-to-Text Transformers Supports Multiple Languages
T
2Vasabi
23
2
Zcabnzh Bp
Bsd-3-clause
BLIP is a unified vision-language pretraining framework, excelling in tasks like image caption generation and visual question answering, with performance enhanced by innovative data filtering mechanisms
Image-to-Text Transformers
Z
nanxiz
19
0
Florence 2 Large Ft
MIT
Florence-2 is an advanced vision foundation model developed by Microsoft, employing a prompt-based approach to handle various vision and vision-language tasks.
Image-to-Text Transformers
F
andito
93
4
Paligemma Rich Captions
Apache-2.0
An image caption generation model fine-tuned on the DocCI dataset based on PaliGemma-3b, capable of generating detailed descriptions of 200-350 characters with reduced hallucination
Image-to-Text Transformers English
P
gokaygokay
66
9
Spydazwebai Image Projectors
An image-to-text model based on the Transformers library, capable of converting image content into descriptive text, particularly suitable for the art domain.
Image-to-Text Supports Multiple Languages
S
LeroyDyer
560
1
Uform Gen2 Qwen 500m
Apache-2.0
UForm-Gen is a small generative vision-language model primarily used for image caption generation and visual question answering.
Image-to-Text Transformers English
U
unum-cloud
17.98k
76
Git Base One Piece
MIT
A vision-language model fine-tuned from Microsoft's git-base model, specifically designed to generate descriptive text captions for images from the anime 'One Piece'
Image-to-Text Transformers Supports Multiple Languages
G
ayoubkirouane
16
0
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase